Document Description

This document includes the code, output, and interpretation of the analysis conduced on a dataset derived from the VERIS Community Database (VCDB) as outlined in the RiskLens Data Science Candidate Task found at: https://github.com/RiskLens/DataScience_Hiring_Task_2019.

Executive Summary

Description of Data: The exercise involved accessing a currated dataset, provided by RiskLens, that was derived from the VERIS Community Database (VCDB). The dataset contained incidents, when they occured (day, month, and year), the entity that caused the incident, type of action take, the type of breach (confidentiality, integrity, or availability), the type and variety of assets that were compromised, the number of records breached, victim count, and the location of the victims.

Summary of Findings from Exploratory Analysis: The number of incidents reported each year ranged from 56 to 395, with the peak ocurring in 2013. Physical attacks were the most common action in 2010, representing 50% of incidence, but declined to represent only 12.5% of incidents by 2018%. Conversely, hacking attacks represented only 4% in 2010 but increased to 29% by 2010. Furthermore, the largest breach which contained 50,000,000 records was the result of a hacking incident in 2016. There were no major changes with regard to the type of actor behind the incidents, with the exception of an atypically large proportion of incidents from internal actors occuring in 2015. Between 2010 and 2018, incidents occuring from user development decreased from 40% to 9%, while incidence from servers and persons increased from 14% to 32% and from 0.5% to 18%, respectively. There was no meaningful change in proportions of confidentiality breaches from 2010 to 2018 (100% to 98%), but there was an increase in the proportion of integrity incidents (3% to 36%) and a decrease in the proportion of availability incidents (68% to 27%). The proportion of incidents that occured from outside the United States decreased increased from 5% in 2010 to 23% in 2018, with a spike at 30% in 2017. There appeared to be a shift in the seasonality of incidents from 2010-2018, with the proportion of incidents in November and December decreasing from 12% in each month to 2% and 0%, respectively; there was an increase in the number of incidents in January from 1% to 12% over the same time frame, suggesting a potential shift in seaasonality of incidents.

Summary of Findings from Modeling Analysis: Assuming a large healthcare employer (1001 employees or larger) and an incident occurs due to an internal actor, it is estimated with 90% confidence that the number of medical records breached falls between 10,867 and 482,810 records, with 234,054 records being the most likely number of medical records breached during a given incident. With regard to total records (inclusive of medical records), it is estimated with 90% confidence that the umber of total records breached during an incident falls between 10,522 and 476,917, with 29,388 records being the most likely number of total records breached during an incident. Initial analysis using a generalized linear model (GLM) found that there fewer records being obtained with each breach over time (from 2010-2018). There was a singular large outlier of a 1,055,489 record breach that occured in 2011; this value was removed from the data set and the GLM was re-ran. The model without the outlier also showed that the number of records breached over time was decreasing. Furthermore, the removal of the outlier improved the model fit substantially, reducing the Akaike information criterion from 13,176,462 to 581,860. However, based on the overall model fit and the magnitude of the decrease over time, the interpretation of the reduction in records breached over time should be interpreted with additional information outside the model’s parameters.

Exploratory Analysis

The exploratory analysis portion of the exercise is organized as follows:

Step 1) Load necessary libraries

Step 2) Load the data

Step 3) Filter data based on specified criteria

Step 4) Conduct requested analysis and provide results and interpretation at each step

Step 5) Conduct additional analysis to enrich findings

Step 1) Load Necessary Libraries

library(reader)
library(knitr)
library(httr)
library(ggplot2)
library(kableExtra)
library(dplyr)
library(mc2d)
library(rcompanion)
  #Libraries for bonus wordle from summary data
library(RXKCD)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(tidyverse)
library(ciTools)
library(MASS) 
library(arm)

Step 2) Load Data

Data will be loaded directly from GitHub as to not rely on data residing on local machine

  VCDB_RAW=read.csv("https://raw.githubusercontent.com/RiskLens/DataScience_Hiring_Task_2019/master/data/vcdb_medical_simplified.csv")

Step 3) Filter data based on specified criteria.

Set Instructions: You should further filter the data to the years 2010-2018 (inclusive of both)

  #Filter the data to the years 2010-2018 (inclusive of both).
  VCDB=subset(VCDB_RAW, timeline.incident.year >= 2010 & timeline.incident.year <= 2018) 
 
   #Visual confirmation of succesful filter
  ggplot(data=VCDB, aes(x=timeline.incident.year)) +geom_bar(fill = "darkred")          

  #Numerical confirmation of successful filter
  min(VCDB$timeline.incident.year)
## [1] 2010
  max(VCDB$timeline.incident.year)
## [1] 2018

Step 4) Complete an exploratory analysis of the data set and answer at least the following questions:

4A) How many total incidents are in the database for each year?

    #a. How many total incidents are in the database for each year?
          #Display Graphically Ordered Temporally
          ggplot(data=VCDB, aes(x=timeline.incident.year)) +geom_bar(fill = "darkred") +geom_text(stat='count', aes(label=..count..), vjust=-1)  

    #Print Numerical Table
          incident.by.year.table=table(VCDB$timeline.incident.year)
          kable(incident.by.year.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
Var1 Freq
2010 201
2011 162
2012 286
2013 395
2014 310
2015 316
2016 258
2017 197
2018 56

4C) Repeat step b but for actor.

          #Visually graph actor by years
          ggplot(aes(x = timeline.incident.year ), data = VCDB) + 
          geom_histogram(aes(fill = actor ), binwidth=1, colour="grey20", lwd=0.2) +
          scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))

          #Generate raw numbers
          actor.by.year.table=table(VCDB$timeline.incident.year,VCDB$actor)
           kable(actor.by.year.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
external internal partner unknown
2010 113 61 15 12
2011 83 63 6 10
2012 151 103 19 13
2013 162 200 22 11
2014 125 166 16 3
2015 94 211 8 3
2016 146 98 10 4
2017 102 84 9 2
2018 27 27 1 1
          #Generate proportions with new dataset
          actor.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$actor),1) #percentages by row 
          actor.by.year.table.prop.data=as.data.frame(actor.by.year.table.prop.table)
          actor.by.year.table.prop.data=actor.by.year.table.prop.data %>% rename( year=Var1, actor=Var2, proportion=Freq)
            #Print Proportions Table
          kable(actor.by.year.table.prop.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
external internal partner unknown
2010 0.5621891 0.3034826 0.0746269 0.0597015
2011 0.5123457 0.3888889 0.0370370 0.0617284
2012 0.5279720 0.3601399 0.0664336 0.0454545
2013 0.4101266 0.5063291 0.0556962 0.0278481
2014 0.4032258 0.5354839 0.0516129 0.0096774
2015 0.2974684 0.6677215 0.0253165 0.0094937
2016 0.5658915 0.3798450 0.0387597 0.0155039
2017 0.5177665 0.4263959 0.0456853 0.0101523
2018 0.4821429 0.4821429 0.0178571 0.0178571
          #Proportion Visualizations
              
              #Proportion bar charts
              actor.by.year.table.prop.data.bar.chart=ggplot(actor.by.year.table.prop.data, aes(x=year, y=proportion, fill=actor))+ geom_bar(stat="identity", position = "fill")
              actor.by.year.table.prop.data.bar.chart

              #individual line graphs
              actor.by.year.prop.line.graph<-ggplot(actor.by.year.table.prop.data, aes(x=year, y=proportion, group=actor)) +
              geom_line(aes(color=actor))+
              geom_point(aes(color=actor))
              print(actor.by.year.prop.line.graph)

              #Facet by actor to pull apart and visualize
              Facet.actor.by.year.prop.line.graph=actor.by.year.prop.line.graph+facet_wrap(. ~ actor,ncol=3)
              print(Facet.actor.by.year.prop.line.graph)

4D) Repeat step b but for asset

          #Visually graph asset by years
          ggplot(aes(x = timeline.incident.year ), data = VCDB) + 
          geom_histogram(aes(fill = asset ), binwidth=1, colour="grey20", lwd=0.2) +
          scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))

          #Generate raw numbers
          asset.by.year.table=table(VCDB$timeline.incident.year,VCDB$asset)
           kable(asset.by.year.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
kiosk/term media network person server unknown user dev
2010 0 0 83 0 1 29 8 80
2011 0 0 58 0 4 46 8 46
2012 0 1 90 1 7 74 18 95
2013 0 0 122 3 9 108 26 127
2014 1 0 99 1 15 88 28 78
2015 1 0 113 0 19 123 16 44
2016 1 1 78 0 27 120 10 21
2017 5 1 57 3 28 74 13 16
2018 1 1 19 0 10 18 2 5
          #Generate proportions with new dataset
          asset.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$asset),1) #percentages by row 
          asset.by.year.table.prop.data=as.data.frame(asset.by.year.table.prop.table)
          asset.by.year.table.prop.data=asset.by.year.table.prop.data %>% rename( year=Var1, asset=Var2, proportion=Freq)
            #Print Proportions Table
          kable(asset.by.year.table.prop.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
kiosk/term media network person server unknown user dev
2010 0.0000000 0.0000000 0.4129353 0.0000000 0.0049751 0.1442786 0.0398010 0.3980100
2011 0.0000000 0.0000000 0.3580247 0.0000000 0.0246914 0.2839506 0.0493827 0.2839506
2012 0.0000000 0.0034965 0.3146853 0.0034965 0.0244755 0.2587413 0.0629371 0.3321678
2013 0.0000000 0.0000000 0.3088608 0.0075949 0.0227848 0.2734177 0.0658228 0.3215190
2014 0.0032258 0.0000000 0.3193548 0.0032258 0.0483871 0.2838710 0.0903226 0.2516129
2015 0.0031646 0.0000000 0.3575949 0.0000000 0.0601266 0.3892405 0.0506329 0.1392405
2016 0.0038760 0.0038760 0.3023256 0.0000000 0.1046512 0.4651163 0.0387597 0.0813953
2017 0.0253807 0.0050761 0.2893401 0.0152284 0.1421320 0.3756345 0.0659898 0.0812183
2018 0.0178571 0.0178571 0.3392857 0.0000000 0.1785714 0.3214286 0.0357143 0.0892857
          #Proportion Visualizations
              
              #Proportion bar charts
              asset.by.year.table.prop.data.bar.chart=ggplot(asset.by.year.table.prop.data, aes(x=year, y=proportion, fill=asset))+ geom_bar(stat="identity", position = "fill")
              asset.by.year.table.prop.data.bar.chart

              #individual line graphs
              asset.by.year.prop.line.graph<-ggplot(asset.by.year.table.prop.data, aes(x=year, y=proportion, group=asset)) +
              geom_line(aes(color=asset))+
              geom_point(aes(color=asset))
              print(asset.by.year.prop.line.graph)

              #Facet by asset to pull apart and visualize
              Facet.asset.by.year.prop.line.graph=asset.by.year.prop.line.graph+facet_wrap(. ~ asset,ncol=3)
              print(Facet.asset.by.year.prop.line.graph)

4E) Repeat step b but for 3 attribute variable

4E.1) Confidentiality
          #Visually graph attribute.confidentiality by years
          ggplot(aes(x = timeline.incident.year ), data = VCDB) + 
          geom_histogram(aes(fill = attribute.confidentiality ), binwidth=1, colour="grey20", lwd=0.2) +
          scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))

          #Generate raw numbers
          attribute.confidentiality.by.year.table=table(VCDB$timeline.incident.year,VCDB$attribute.confidentiality)
           kable(attribute.confidentiality.by.year.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
False True
2010 0 201
2011 0 162
2012 1 285
2013 7 388
2014 4 306
2015 2 314
2016 6 252
2017 0 197
2018 1 55
          #Generate proportions with new dataset
          attribute.confidentiality.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$attribute.confidentiality),1) #percentages by row 
          attribute.confidentiality.by.year.table.prop.data=as.data.frame(attribute.confidentiality.by.year.table.prop.table)
          attribute.confidentiality.by.year.table.prop.data=attribute.confidentiality.by.year.table.prop.data %>% rename( year=Var1, attribute.confidentiality=Var2, proportion=Freq)
            #Print Proportions Table
          kable(attribute.confidentiality.by.year.table.prop.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
False True
2010 0.0000000 1.0000000
2011 0.0000000 1.0000000
2012 0.0034965 0.9965035
2013 0.0177215 0.9822785
2014 0.0129032 0.9870968
2015 0.0063291 0.9936709
2016 0.0232558 0.9767442
2017 0.0000000 1.0000000
2018 0.0178571 0.9821429
          #Proportion Visualizations
              
              #Proportion bar charts
              attribute.confidentiality.by.year.table.prop.data.bar.chart=ggplot(attribute.confidentiality.by.year.table.prop.data, aes(x=year, y=proportion, fill=attribute.confidentiality))+ geom_bar(stat="identity", position = "fill")
              attribute.confidentiality.by.year.table.prop.data.bar.chart

              #individual line graphs
              attribute.confidentiality.by.year.prop.line.graph<-ggplot(attribute.confidentiality.by.year.table.prop.data, aes(x=year, y=proportion, group=attribute.confidentiality)) +
              geom_line(aes(color=attribute.confidentiality))+
              geom_point(aes(color=attribute.confidentiality))
              print(attribute.confidentiality.by.year.prop.line.graph)

              #Facet by attribute.confidentiality to pull apart and visualize
              Facet.attribute.confidentiality.by.year.prop.line.graph=attribute.confidentiality.by.year.prop.line.graph+facet_wrap(. ~ attribute.confidentiality,ncol=3)
              print(Facet.attribute.confidentiality.by.year.prop.line.graph)  

4E.2) Integrity
          #Visually graph attribute.integrity by years
          ggplot(aes(x = timeline.incident.year ), data = VCDB) + 
          geom_histogram(aes(fill = attribute.integrity ), binwidth=1, colour="grey20", lwd=0.2) +
          scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))

          #Generate raw numbers
          attribute.integrity.by.year.table=table(VCDB$timeline.incident.year,VCDB$attribute.integrity)
           kable(attribute.integrity.by.year.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
False True
2010 195 6
2011 151 11
2012 269 17
2013 367 28
2014 282 28
2015 277 39
2016 192 66
2017 138 59
2018 36 20
          #Generate proportions with new dataset
          attribute.integrity.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$attribute.integrity),1) #percentages by row 
          attribute.integrity.by.year.table.prop.data=as.data.frame(attribute.integrity.by.year.table.prop.table)
          attribute.integrity.by.year.table.prop.data=attribute.integrity.by.year.table.prop.data %>% rename( year=Var1, attribute.integrity=Var2, proportion=Freq)
            #Print Proportions Table
          kable(attribute.integrity.by.year.table.prop.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
False True
2010 0.9701493 0.0298507
2011 0.9320988 0.0679012
2012 0.9405594 0.0594406
2013 0.9291139 0.0708861
2014 0.9096774 0.0903226
2015 0.8765823 0.1234177
2016 0.7441860 0.2558140
2017 0.7005076 0.2994924
2018 0.6428571 0.3571429
          #Proportion Visualizations
              
              #Proportion bar charts
              attribute.integrity.by.year.table.prop.data.bar.chart=ggplot(attribute.integrity.by.year.table.prop.data, aes(x=year, y=proportion, fill=attribute.integrity))+ geom_bar(stat="identity", position = "fill")
              attribute.integrity.by.year.table.prop.data.bar.chart

              #individual line graphs
              attribute.integrity.by.year.prop.line.graph<-ggplot(attribute.integrity.by.year.table.prop.data, aes(x=year, y=proportion, group=attribute.integrity)) +
              geom_line(aes(color=attribute.integrity))+
              geom_point(aes(color=attribute.integrity))
              print(attribute.integrity.by.year.prop.line.graph)

              #Facet by attribute.integrity to pull apart and visualize
              Facet.attribute.integrity.by.year.prop.line.graph=attribute.integrity.by.year.prop.line.graph+facet_wrap(. ~ attribute.integrity,ncol=3)
              print(Facet.attribute.integrity.by.year.prop.line.graph)

4E.3) Availability
          #Visually graph attribute.availability by years
          ggplot(aes(x = timeline.incident.year ), data = VCDB) + 
          geom_histogram(aes(fill = attribute.availability ), binwidth=1, colour="grey20", lwd=0.2) +
          scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))

          #Generate raw numbers
          attribute.availability.by.year.table=table(VCDB$timeline.incident.year,VCDB$attribute.availability)
           kable(attribute.availability.by.year.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
False True
2010 65 136
2011 83 79
2012 130 156
2013 217 178
2014 183 127
2015 237 79
2016 162 96
2017 132 65
2018 41 15
          #Generate proportions with new dataset
          attribute.availability.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$attribute.availability),1) #percentages by row 
          attribute.availability.by.year.table.prop.data=as.data.frame(attribute.availability.by.year.table.prop.table)
          attribute.availability.by.year.table.prop.data=attribute.availability.by.year.table.prop.data %>% rename( year=Var1, attribute.availability=Var2, proportion=Freq)
            #Print Proportions Table
          kable(attribute.availability.by.year.table.prop.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
False True
2010 0.3233831 0.6766169
2011 0.5123457 0.4876543
2012 0.4545455 0.5454545
2013 0.5493671 0.4506329
2014 0.5903226 0.4096774
2015 0.7500000 0.2500000
2016 0.6279070 0.3720930
2017 0.6700508 0.3299492
2018 0.7321429 0.2678571
          #Proportion Visualizations
              
              #Proportion bar charts
              attribute.availability.by.year.table.prop.data.bar.chart=ggplot(attribute.availability.by.year.table.prop.data, aes(x=year, y=proportion, fill=attribute.availability))+ geom_bar(stat="identity", position = "fill")
              attribute.availability.by.year.table.prop.data.bar.chart

              #individual line graphs
              attribute.availability.by.year.prop.line.graph<-ggplot(attribute.availability.by.year.table.prop.data, aes(x=year, y=proportion, group=attribute.availability)) +
              geom_line(aes(color=attribute.availability))+
              geom_point(aes(color=attribute.availability))
              print(attribute.availability.by.year.prop.line.graph)

              #Facet by attribute.availability to pull apart and visualize
              Facet.attribute.availability.by.year.prop.line.graph=attribute.availability.by.year.prop.line.graph+facet_wrap(. ~ attribute.availability,ncol=3)
              print(Facet.attribute.availability.by.year.prop.line.graph)

4F) Do we see a trend for the proportion of incidents within the US versus outside of the US?

          #Create new variable marking inside or outside US
          VCDB$US_BOOLEAN=ifelse(VCDB$victim.country=="US", "US", "NON_US")
         
          #Visually graph asset by years
          ggplot(aes(x = timeline.incident.year ), data = VCDB) + 
          geom_histogram(aes(fill = US_BOOLEAN ), binwidth=1, colour="grey20", lwd=0.2) +
          scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))

          #Generate raw numbers
          victim.country.by.year.table=table(VCDB$timeline.incident.year,VCDB$US_BOOLEAN)
          print(victim.country.by.year.table)
##       
##        NON_US  US
##   2010     10 191
##   2011     21 141
##   2012     31 255
##   2013     69 326
##   2014     54 256
##   2015     76 240
##   2016     63 195
##   2017     59 138
##   2018     13  43
          #Generate proportions with new dataset
          victim.country.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$US_BOOLEAN),1) #percentages by row
          victim.country.by.year.table.prop.data=as.data.frame(victim.country.by.year.table.prop.table)
          victim.country.by.year.table.prop.data=victim.country.by.year.table.prop.data %>% rename(year=Var1, victim.country=Var2, proportion=Freq)
          
          kable(victim.country.by.year.table.prop.data) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
year victim.country proportion
2010 NON_US 0.0497512
2011 NON_US 0.1296296
2012 NON_US 0.1083916
2013 NON_US 0.1746835
2014 NON_US 0.1741935
2015 NON_US 0.2405063
2016 NON_US 0.2441860
2017 NON_US 0.2994924
2018 NON_US 0.2321429
2010 US 0.9502488
2011 US 0.8703704
2012 US 0.8916084
2013 US 0.8253165
2014 US 0.8258065
2015 US 0.7594937
2016 US 0.7558140
2017 US 0.7005076
2018 US 0.7678571
          #Proportion Visualizations
          
          #Proportion bar charts
          victim.country.by.year.table.prop.data.bar.chart=ggplot(victim.country.by.year.table.prop.data, aes(x=year, y=proportion, fill=victim.country)) + geom_bar(stat="identity", 
          position = "fill")
          victim.country.by.year.table.prop.data.bar.chart

          #individual line graphs
          victim.country.by.year.prop.line.graph<-ggplot(victim.country.by.year.table.prop.data, aes(x=year, y=proportion, group=victim.country)) +
            geom_line(aes(color=victim.country))+
            geom_point(aes(color=victim.country))
          print(victim.country.by.year.prop.line.graph)

          #Facet by victim.country to pull apart and visualize
          Facet.victim.country.by.year.prop.line.graph=victim.country.by.year.prop.line.graph+facet_wrap(. ~ victim.country,ncol=3)
          print(Facet.victim.country.by.year.prop.line.graph)

4G) Please feel free to share any other notable findings as you explore the data?

#Explore temporal nature within year by month to look for seasonality  (all data)
          #Display Graphically Ordered Temporally
          ggplot(data=VCDB, aes(x=timeline.incident.month)) +geom_bar(fill = "darkred") +geom_text(stat='count', aes(label=..count..), vjust=-1)  
## Warning: Removed 629 rows containing non-finite values (stat_count).

## Warning: Removed 629 rows containing non-finite values (stat_count).

    #Print Numerical Table
          incident.by.month.table=table(VCDB$timeline.incident.month)
          kable(incident.by.month.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
Var1 Freq
1 126
2 150
3 132
4 123
5 131
6 141
7 148
8 133
9 108
10 114
11 123
12 123
     #Generate proportions with new dataset
          victim.month.by.year.table.prop.table=prop.table(table(VCDB$timeline.incident.year,VCDB$timeline.incident.month),1)   #percentages by row
          victim.month.by.year.table.prop.data=as.data.frame(victim.month.by.year.table.prop.table)
          victim.month.by.year.table.prop.data=victim.month.by.year.table.prop.data %>% rename(year=Var1, timeline.incident.month=Var2, proportion=Freq)
          
          kable(victim.month.by.year.table.prop.data) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
year timeline.incident.month proportion
2010 1 0.0106952
2011 1 0.1138211
2012 1 0.0930233
2013 1 0.1075269
2014 1 0.0534759
2015 1 0.0883978
2016 1 0.0725389
2017 1 0.1034483
2018 1 0.1190476
2010 2 0.1176471
2011 2 0.0813008
2012 2 0.0511628
2013 2 0.1039427
2014 2 0.1283422
2015 2 0.0607735
2016 2 0.1295337
2017 2 0.1103448
2018 2 0.0476190
2010 3 0.0588235
2011 3 0.1138211
2012 3 0.0651163
2013 3 0.0681004
2014 3 0.0962567
2015 3 0.0994475
2016 3 0.1088083
2017 3 0.0689655
2018 3 0.1666667
2010 4 0.0695187
2011 4 0.1056911
2012 4 0.0744186
2013 4 0.0788530
2014 4 0.0802139
2015 4 0.0828729
2016 4 0.0673575
2017 4 0.0896552
2018 4 0.0714286
2010 5 0.0962567
2011 5 0.0487805
2012 5 0.1255814
2013 5 0.0860215
2014 5 0.0481283
2015 5 0.0607735
2016 5 0.0777202
2017 5 0.1241379
2018 5 0.0714286
2010 6 0.0481283
2011 6 0.0650407
2012 6 0.0697674
2013 6 0.0860215
2014 6 0.1069519
2015 6 0.1270718
2016 6 0.0880829
2017 6 0.1172414
2018 6 0.1904762
2010 7 0.0802139
2011 7 0.0813008
2012 7 0.1534884
2013 7 0.0609319
2014 7 0.0588235
2015 7 0.0994475
2016 7 0.1243523
2017 7 0.0689655
2018 7 0.2380952
2010 8 0.0802139
2011 8 0.0650407
2012 8 0.0604651
2013 8 0.1182796
2014 8 0.0962567
2015 8 0.0718232
2016 8 0.0829016
2017 8 0.1034483
2018 8 0.0476190
2010 9 0.1016043
2011 9 0.0487805
2012 9 0.0604651
2013 9 0.0860215
2014 9 0.0802139
2015 9 0.0441989
2016 9 0.0518135
2017 9 0.0827586
2018 9 0.0238095
2010 10 0.0909091
2011 10 0.0406504
2012 10 0.0883721
2013 10 0.0609319
2014 10 0.0802139
2015 10 0.1049724
2016 10 0.0518135
2017 10 0.0827586
2018 10 0.0000000
2010 11 0.1283422
2011 11 0.1138211
2012 11 0.0930233
2013 11 0.0752688
2014 11 0.0855615
2015 11 0.0662983
2016 11 0.0569948
2017 11 0.0275862
2018 11 0.0238095
2010 12 0.1176471
2011 12 0.1219512
2012 12 0.0651163
2013 12 0.0681004
2014 12 0.0855615
2015 12 0.0939227
2016 12 0.0880829
2017 12 0.0206897
2018 12 0.0000000
          #Proportion Visualizations
          
          #Proportion bar charts
          victim.month.by.year.table.prop.data.bar.chart=ggplot(victim.month.by.year.table.prop.data, aes(x=year, y=proportion, fill=timeline.incident.month)) + geom_bar(stat="identity", 
          position = "fill")
          victim.month.by.year.table.prop.data.bar.chart

#Explore relationship between type of action and amount of records breached
          
          #Display Graphically with quick box plot
          
  records.by.action <- ggplot(VCDB, aes(action, confidentiality.total_record_count, color=action )) +
  stat_boxplot(fill = NA) +
  labs(subtitle = "Plot of Total Record Count by Action Class")
  records.by.action

Modeling Questions

Let’s assume that you work for a large healthcare employer (1001 employees or larger), and you are scoping a risk scenario where you are worried about an insider threat (actor: internal) compromising the confidentiality of medical records. Assuming that you will have a cybersecurity incident this year, based on the data you have can you come up with a model that will estimate, with 90% confidence, the range of the counts of medical records that will be compromised in such an incident (minimum and maximum)? Within that 90% confidence interval, what is the most likely count of the breached records? You may ignore those employers with unknown employee counts for the sake of time. Bonus points if you integrate the year of breach into your model – are the trends changing?

#Subset data to include only insider threats (actor:interal). 
Internal.actor.data=subset(VCDB, actor=="internal" & victim.employee_count %in% c("1001 to 10000","10001 to 25000", "250001 to 50000", "50001 to 100000","over 100000", "large"))

  #Visual Check
  ggplot(aes(x = timeline.incident.year ), data = Internal.actor.data) + 
  geom_histogram(aes(fill = victim.employee_count ), binwidth=1, colour="darkred", lwd=0.2)  

  scale_x_continuous(breaks=seq(0,max(VCDB$timeline.incident.year), 1))
## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1
  #Numerical Table Check
  internal.actor.table=table(Internal.actor.data$victim.employee_count,Internal.actor.data$timeline.incident.year)
  kable(internal.actor.table) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
2010 2011 2012 2013 2014 2015 2016 2017 2018
1 to 10 0 0 0 0 0 0 0 0 0
10001 to 25000 3 2 6 8 6 7 8 9 2
1001 to 10000 17 17 29 37 35 36 25 14 2
101 to 1000 0 0 0 0 0 0 0 0 0
11 to 100 0 0 0 0 0 0 0 0 0
25001 to 50000 0 0 0 0 0 0 0 0 0
50001 to 100000 1 0 1 0 0 1 0 0 0
large 2 1 2 5 4 14 11 3 2
over 100000 0 0 0 0 0 1 1 1 0
small 0 0 0 0 0 0 0 0 0
unknown 0 0 0 0 0 0 0 0 0
  #Confidentiality Records Analysis

  #Visualize counts of confidentiality.medical_records records by breach
  ggplot(aes(x = confidentiality.medical_records ), data = Internal.actor.data) + geom_histogram(aes(fill = actor ), 
  binwidth=100,colour="darkred", lwd=0.2) + geom_bar(fill = "darkred") 

  scale_x_continuous(breaks=seq(0,max(Internal.actor.data$confidentiality.medical_records), 1))
## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1
    #Numerical Table Check

    internal.actor.table.confidentiality.medical.records=table(Internal.actor.data$confidentiality.medical_records)
    kable(internal.actor.table.confidentiality.medical.records) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
Var1 Freq
0 138
1 31
2 7
3 2
7 1
8 3
9 2
14 3
19 1
22 1
25 1
34 1
52 1
60 1
74 1
92 1
100 2
104 1
111 1
112 1
122 1
141 1
142 1
144 1
164 1
197 1
201 1
246 1
280 1
297 1
397 1
400 1
430 1
453 1
480 1
500 1
502 1
515 1
528 1
568 1
586 1
600 1
606 1
610 1
642 1
648 1
652 1
677 1
697 1
772 1
827 1
833 1
844 1
874 1
878 1
897 1
967 1
1000 1
1029 1
1064 1
1087 1
1100 1
1126 1
1136 1
1185 1
1215 1
1290 1
1300 1
1310 1
1382 1
1400 1
1537 1
1655 1
1726 1
1740 1
1745 1
1782 1
1800 3
1801 1
1845 1
1920 1
2027 1
2159 1
2264 1
2284 1
2300 1
2400 1
2600 2
2700 1
2777 1
2928 1
3200 1
3247 1
3300 1
3334 1
3598 1
3800 4
3900 1
4721 1
4879 1
5000 1
5103 1
5400 1
5511 1
6000 1
6307 1
6800 1
6923 1
7445 1
7500 1
8020 1
12600 1
14000 1
19750 1
20000 2
24188 1
24750 1
33000 1
40000 1
42000 1
55900 1
83945 1
113528 1
115143 1
315000 1
1055489 1
  #obtain min, max, mode values for PERT distribution analysis
  min.pert=min(Internal.actor.data$confidentiality.medical_records)
  max.pert=max(Internal.actor.data$confidentiality.medical_records)
      #create mode function and then calculate mode
      getmode <- function(v) {
         uniqv <- unique(v)
          uniqv[which.max(tabulate(match(v, uniqv)))]
      }
      mode.pert=getmode(Internal.actor.data$confidentiality.medical_records)
  #set number of runs for analysis
      num.run.pert=10000
      
  #Generate PERT Statistics 
        #NOTE: This will generate a distribution with a level of randomness so exact values will be slightly different with          each "run" or "knit"
      pert.confidentiality.medical_records=rpert(num.run.pert,min.pert,mode.pert,max.pert, shape=4)
      pert.data=data.frame(pert.confidentiality.medical_records)
      
  #Plot Distribution of Pert Statistics
      gg <- ggplot(pert.data, aes(x = pert.confidentiality.medical_records))
      gg <- gg + geom_histogram(aes(y = ..density..),
                          color="black", 
                          fill = "white", 
                          binwidth = 100)
      gg <- gg + geom_density(fill = "darkred", alpha = 1/3)
      gg <- gg + theme_bw()
      gg

      #Generate Summary Statistics for Answer
      mode.pert.result=getmode(pert.data$pert.confidentiality.medical_records)
      confidence.interval.pert=quantile(pert.data$pert.confidentiality.medical_records, c(0.05,0.95)) 
      print(mode.pert.result)
## [1] 333312.8
      print(confidence.interval.pert)
##        5%       95% 
##  11032.49 486653.49
      #Examine number of cases of confidentiality medical records over time and generate 90% confidence intervals
          #Utilize Poisson Distribution Due to Count Data 
      
      confidentiality.medical_records.over.time=glm(confidentiality.medical_records~timeline.incident.year,data=Internal.actor.data, 
      family=poisson)
      summary(confidentiality.medical_records.over.time)
## 
## Call:
## glm(formula = confidentiality.medical_records ~ timeline.incident.year, 
##     family = poisson, data = Internal.actor.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -241.39  -114.98   -70.13   -42.77  2558.58  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             1.004e+03  7.496e-01    1340   <2e-16 ***
## timeline.incident.year -4.945e-01  3.726e-04   -1327   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 15243883  on 312  degrees of freedom
## Residual deviance: 13175182  on 311  degrees of freedom
## AIC: 13176462
## 
## Number of Fisher Scoring iterations: 8
      df_ints <- Internal.actor.data%>% 
      add_ci(confidentiality.medical_records.over.time, names = c("lcb", "ucb"), alpha = 0.1) %>%
      add_pi(confidentiality.medical_records.over.time, names = c("lpb", "upb"), alpha = 0.1, nSims = 20000) %>%
      print()
## Warning in add_pi.glm(., confidentiality.medical_records.over.time, names
## = c("lpb", : The response is not continuous, so Prediction Intervals are
## approximate
## # A tibble: 313 x 28
##    incident_id timeline.incide… timeline.incide… timeline.incide…
##  * <fct>                  <dbl>            <dbl> <fct>           
##  1 B124687D-1…               27                8 ""              
##  2 89A98591-D…               NA               NA ""              
##  3 1c0cdc80-e…                1                3 ""              
##  4 29951776-2…               NA               NA ""              
##  5 ec902e80-d…               NA               NA ""              
##  6 CFC5936D-D…               NA                7 ""              
##  7 D53164B6-B…                8                3 ""              
##  8 d4359e90-d…               NA                2 ""              
##  9 4765F3B9-7…               NA                3 ""              
## 10 F4CF0FC8-8…                1               10 ""              
## # ... with 303 more rows, and 24 more variables:
## #   timeline.incident.year <int>, actor <fct>, action <fct>,
## #   attribute.confidentiality <fct>, attribute.integrity <fct>,
## #   attribute.availability <fct>, asset <fct>, asset.variety <fct>,
## #   confidentiality.medical_records <dbl>,
## #   confidentiality.payment_records <dbl>,
## #   confidentiality.personal_records <dbl>,
## #   confidentiality.total_record_count <dbl>, victim.employee_count <fct>,
## #   victim.state <fct>, victim.country <fct>, victim.victim_id <fct>,
## #   summary <fct>, reference <fct>, US_BOOLEAN <chr>, pred <dbl>,
## #   lcb <dbl>, ucb <dbl>, lpb <int>, upb <int>
      df_ints %>% 
      ggplot(aes(x = timeline.incident.year, y = confidentiality.medical_records)) +
      geom_point(size = 2) +
      ggtitle("Poisson Regression", subtitle = "Model fit (black line), with prediction intervals (gray), confidence intervals (dark gray)") +
      geom_line(aes(x = timeline.incident.year, y = pred), size = 1.2) + 
      geom_ribbon(aes(ymin = lcb, ymax = ucb), alpha = 0.4) +
      geom_ribbon(aes(ymin = lpb, ymax = upb), alpha = 0.2)

      #Outlier spotted, rerun analysis with no outlier data
          #subset based on <10,000 records criteria based on visual inspection
      Internal.actor.data.no.outlier=subset(Internal.actor.data, confidentiality.medical_records<10000)
       confidentiality.medical_records.over.time.no.outlier=glm(confidentiality.medical_records~timeline.incident.year,data=Internal.actor.data.no.outlier, 
      family=poisson)
      summary(confidentiality.medical_records.over.time.no.outlier)
## 
## Call:
## glm(formula = confidentiality.medical_records ~ timeline.incident.year, 
##     family = poisson, data = Internal.actor.data.no.outlier)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -45.697  -36.203  -32.223   -2.758  170.022  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            241.017242   2.178697   110.6   <2e-16 ***
## timeline.incident.year  -0.116451   0.001082  -107.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 592309  on 296  degrees of freedom
## Residual deviance: 580782  on 295  degrees of freedom
## AIC: 581860
## 
## Number of Fisher Scoring iterations: 7
      df_ints <- Internal.actor.data.no.outlier%>% 
      add_ci(confidentiality.medical_records.over.time, names = c("lcb", "ucb"), alpha = 0.1) %>%
      add_pi(confidentiality.medical_records.over.time, names = c("lpb", "upb"), alpha = 0.1, nSims = 20000) %>%
      print()
## Warning in add_pi.glm(., confidentiality.medical_records.over.time, names
## = c("lpb", : The response is not continuous, so Prediction Intervals are
## approximate
## # A tibble: 297 x 28
##    incident_id timeline.incide… timeline.incide… timeline.incide…
##  * <fct>                  <dbl>            <dbl> <fct>           
##  1 B124687D-1…               27                8 ""              
##  2 89A98591-D…               NA               NA ""              
##  3 1c0cdc80-e…                1                3 ""              
##  4 29951776-2…               NA               NA ""              
##  5 ec902e80-d…               NA               NA ""              
##  6 D53164B6-B…                8                3 ""              
##  7 d4359e90-d…               NA                2 ""              
##  8 4765F3B9-7…               NA                3 ""              
##  9 98ED15ED-5…               NA               12 ""              
## 10 9ac51270-d…               25               10 ""              
## # ... with 287 more rows, and 24 more variables:
## #   timeline.incident.year <int>, actor <fct>, action <fct>,
## #   attribute.confidentiality <fct>, attribute.integrity <fct>,
## #   attribute.availability <fct>, asset <fct>, asset.variety <fct>,
## #   confidentiality.medical_records <dbl>,
## #   confidentiality.payment_records <dbl>,
## #   confidentiality.personal_records <dbl>,
## #   confidentiality.total_record_count <dbl>, victim.employee_count <fct>,
## #   victim.state <fct>, victim.country <fct>, victim.victim_id <fct>,
## #   summary <fct>, reference <fct>, US_BOOLEAN <chr>, pred <dbl>,
## #   lcb <dbl>, ucb <dbl>, lpb <int>, upb <int>
      df_ints %>% 
      ggplot(aes(x = timeline.incident.year, y = confidentiality.medical_records)) +
      geom_point(size = 2) +
      ggtitle("Poisson Regression", subtitle = "Model fit (black line), with prediction intervals (gray), confidence intervals (dark gray)") +
      geom_line(aes(x = timeline.incident.year, y = pred), size = 1.2) + 
      geom_ribbon(aes(ymin = lcb, ymax = ucb), alpha = 0.4) +
      geom_ribbon(aes(ymin = lpb, ymax = upb), alpha = 0.2)

 #Total Records Analysis
     
       #Visualize counts of confidentiality.total_record_count records by breach
  ggplot(aes(x = confidentiality.total_record_count ), data = Internal.actor.data) + geom_histogram(aes(fill = actor ), 
  binwidth=100,colour="darkred", lwd=0.2) + geom_bar(fill = "darkred") 

  scale_x_continuous(breaks=seq(0,max(Internal.actor.data$confidentiality.total_record_count), 1))
## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1
    #Numerical Table Check

    internal.actor.table.confidentiality.total_record_count=table(Internal.actor.data$confidentiality.total_record_count)
    kable(internal.actor.table.confidentiality.total_record_count) %>%
          kable_styling(bootstrap_options = c("striped", "hover"))
Var1 Freq
0 78
1 37
2 10
3 3
4 1
7 1
8 4
9 2
11 1
14 4
19 1
20 1
21 1
22 1
25 1
31 1
34 1
44 1
45 1
48 1
49 1
50 2
52 1
57 1
60 1
74 1
92 1
100 3
104 1
111 1
112 1
122 1
124 1
141 1
142 1
144 1
158 1
164 1
197 1
200 1
201 1
246 1
250 1
261 1
280 1
297 1
397 1
400 2
407 1
430 1
453 1
480 1
500 1
502 1
515 1
520 1
528 1
531 1
568 1
586 1
600 1
606 1
610 1
642 1
648 1
652 1
674 1
677 1
697 1
772 1
777 1
781 1
827 1
833 1
844 1
848 1
874 1
878 1
897 1
900 1
957 1
967 1
1000 1
1029 1
1062 1
1064 1
1087 1
1100 1
1124 1
1126 1
1136 1
1185 1
1215 1
1290 1
1300 1
1310 1
1382 1
1399 1
1400 1
1445 1
1537 1
1655 1
1726 1
1740 1
1745 1
1782 1
1800 3
1801 1
1845 1
1920 1
2006 1
2027 1
2159 1
2200 1
2264 1
2284 1
2300 2
2318 1
2400 2
2600 2
2700 1
2777 1
2826 1
2928 1
3200 2
3247 1
3300 1
3334 1
3403 1
3598 1
3800 4
3900 1
4721 1
4879 1
5000 1
5103 1
5261 1
5300 1
5400 1
5511 1
6000 1
6307 1
6800 1
6923 1
7445 1
7500 1
8020 1
12061 1
12600 1
14000 1
14121 1
19750 1
20000 2
24188 1
24750 1
33000 1
40000 1
42000 1
43000 1
44000 1
55900 1
83945 1
108000 1
113528 1
115143 1
315000 1
655550 1
1055489 1
  #obtain min, max, mode values for PERT distribution analysis
  min.pert.total=min(Internal.actor.data$confidentiality.total_record_count)
  max.pert.total=max(Internal.actor.data$confidentiality.total_record_count)
      #create mode function and then calculate mode
      getmode <- function(v) {
         uniqv <- unique(v)
          uniqv[which.max(tabulate(match(v, uniqv)))]
      }
      mode.pert.total=getmode(Internal.actor.data$confidentiality.total_record_count)
  #set number of runs for analysis
      num.run.pert.total=10000
      
  #Generate PERT Statistics 
       #NOTE: This will generate a distribution with a level of randomness so exact values will be slightly different with          each "run" or "knit".
      pert.confidentiality.total_record_count=rpert(num.run.pert.total,min.pert.total,mode.pert.total,max.pert.total, shape=4)
      pert.data.total=data.frame(pert.confidentiality.total_record_count)
      
  #Plot Distribution of Pert Statistics
      gg.total <- ggplot(pert.data.total, aes(x = pert.confidentiality.total_record_count))
      gg.total <- gg.total + geom_histogram(aes(y = ..density..),
                          color="black", 
                          fill = "white", 
                          binwidth = 100)
      gg.total <- gg.total + geom_density(fill = "darkred", alpha = 1/3)
      gg.total <- gg.total + theme_bw()
      gg.total

      #Generate Summary Statistics for Answer
      mode.pert.result.total=getmode(pert.data.total$pert.confidentiality.total_record_count)
      confidence.interval.pert.total=quantile(pert.data.total$pert.confidentiality.total_record_count, c(0.05,0.95))
      print(mode.pert.result.total)
## [1] 155768.9
      print(confidence.interval.pert.total)
##        5%       95% 
##  10804.18 476211.99
#Generate WordCloud from Text (This section takes a while to process)
  #Edit summary text vector into corpus for wordle and generate table of frequencies
ap.corpus <- Corpus(VectorSource(VCDB$summary))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(ap.corpus, removePunctuation):
## transformation drops documents
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(ap.corpus, content_transformer(tolower)):
## transformation drops documents
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
## Warning in tm_map.SimpleCorpus(ap.corpus, function(x) removeWords(x,
## stopwords("english"))): transformation drops documents
ap.corpus <- Corpus(VectorSource(ap.corpus))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 6211 1923 1016  639  482  330  213  198  176  133  119  116   84   87   70 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##   74   58   49   51   44   37   27   33   25   30   29   26   20   35   27 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##   16   21   26   15   16   18   22   17   21   14    8   10    7    7    5 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##    8    8   13   12    5   11    9   10   13    8    8    8    7    2    8 
##   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
##    9   10    3    7    5    7   11    2    4    5    6    1    4    8    8 
##   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90 
##    5    6    2    3    2    5    6    3    4    7    2    4    3    2    4 
##   91   92   93   94   95   96   97   98   99  100  101  102  103  104  105 
##    5    6    3    3    2    1    3    2    4    5    3    1    2    2    2 
##  106  107  109  110  111  112  113  114  115  116  119  120  122  123  126 
##    1    4    2    2    1    2    2    2    2    2    4    3    1    1    1 
##  127  128  130  131  132  134  135  137  138  139  140  141  144  145  148 
##    2    1    1    2    1    2    1    1    1    2    2    3    1    1    3 
##  151  152  153  154  155  156  160  162  163  164  165  167  168  169  171 
##    2    1    1    1    1    1    2    1    1    1    3    1    1    1    1 
##  172  175  176  178  181  188  191  196  202  205  207  209  212  213  214 
##    1    2    1    1    1    1    1    1    1    1    1    1    1    2    1 
##  218  222  226  228  230  237  239  244  249  250  260  265  267  268  269 
##    2    1    2    2    3    1    1    1    1    1    1    1    1    1    1 
##  276  281  286  307  309  321  329  338  340  345  348  361  364  365  366 
##    2    1    1    1    1    1    1    1    1    1    3    1    1    1    1 
##  369  371  372  374  392  394  401  409  410  415  437  443  456  471  504 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  516  527  644  703  836  837  846  920 1021 1154 1319 1424 1488 1773 2696 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
pal2 <- brewer.pal(8,"Dark2")
  #Generate figure as PNG (Will save to working directory folder)
png("wordcloud_packages.png", width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=3,
          max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()
## quartz_off_screen 
##                 2